A blood and bronchoalveolar lavage protein signature of rapid FEV1 decline in smoking-associated COPD

Accelerated progression of chronic obstructive pulmonary disease (COPD) is associated with increased risks of hospitalization and death. Prognostic insights into mechanisms and markers of progression could facilitate development of disease-modifying therapies. Although individual biomarkers exhibit some predictive value, performance is modest and their univariate nature limits network-level insights. To overcome these limitations and gain insights into early pathways associated with rapid progression, we measured 1305 peripheral blood and 48 bronchoalveolar lavage proteins in individuals with COPD [n = 45, mean initial forced expiratory volume in one second (FEV1) 75.6 ± 17.4% predicted]. We applied a data-driven analysis pipeline, which enabled identification of protein signatures that predicted individuals at-risk for accelerated lung function decline (FEV1 decline ≥ 70 mL/year) ~ 6 years later, with high accuracy. Progression signatures suggested that early dysregulation in elements of the complement cascade is associated with accelerated decline. Our results propose potential biomarkers and early aberrant signaling mechanisms driving rapid progression in COPD.

later, with high accuracy. Progression signatures suggested that early dysregulation in elements of the complement cascade is associated with accelerated decline. Our results propose potential biomarkers and early aberrant signaling mechanisms driving rapid progression in COPD.
Chronic obstructive pulmonary disease (COPD), a leading cause of death in the United States 1 , accounts annually for > 600,000 hospitalizations 2 and $30 billion in direct health expenditures 3 . The course of COPD is heterogeneous. Such heterogeneity is exemplified in part by the highly variable rates of annualized decline in forced expiratory volume in one second (FEV 1 ) observed in prospective observational cohort studies [4][5][6][7] . This variability between rates of lung function decline causes some individuals to experience relatively stable courses, while others experience accelerated loss of function that leads to severe breathlessness and increased risk of hospitalization and death 6 . Because phenotypic diversity is underpinned by biological heterogeneity, there is growing interest in identifying molecular biomarkers to predict multiple aspects of COPD progression. Such molecular markers could help explain heterogeneity and facilitate the early detection of individuals at risk for accelerated lung function decline, enabling personalized management to arrest disease progression. Biomarkers could also direct research into underlying pathogenic mechanisms to uncover novel therapeutic targets.
To date, accelerated FEV 1 decline has been associated with individual blood proteins, including club cell secretory protein 16 (CC16) 8,9 , soluble receptor for advanced glycation end-products (sRAGE) 8 , fibrinogen 8 , C-reactive protein (CRP) 8,10 , and interleukin 6 (IL-6) 11 , although contradictory reports exist 10,[12][13][14] . The ratio of leptin to adiponectin in plasma also demonstrated predictive value for rapid lung function decline but showed only moderate sensitivity (63.5%) and specificity (65.1% ) 15 . While valuable, univariate analyses of candidate biomarkers have provided limited insights into underlying mechanisms of progression, a shortcoming that could be complemented by network-level analysis of large numbers of proteins.
Previous studies have also been limited to measuring biomarkers in a single tissue compartment due to sample availability. However, data indicate that combinations of proteins appear to be better predictors of multiple COPD outcomes (including FEV 1 decline) than individual factors 8 , especially when derived from multiple compartments. Cross-sectional studies integrating datasets across multiple molecular levels and anatomical locations has improved the ability to classify individuals with a smoking history by COPD status and uncovered novel diseaseassociated pathways [16][17][18] . Accordingly, analyses of lung function decline may benefit from evaluating systems-level integrated networks that are more likely to capture the diverse biology driving airflow obstruction 19,20 .
Data-driven modeling is one approach that will enable inference of network-level relationships driving progression. By permitting data integration across multiple tissue compartments, data-driven modeling generates systemic networks ("signatures") of co-varying biological factors associated with disease phenotypes. Identified signatures can be linked to pathogenic mechanisms, providing insight into potential targets for follow-up experiments or as biomarkers for therapeutic intervention. These approaches have been successfully used to identify blood and bronchoalveolar lavage (BAL) protein signatures associated with disease state and progression in idiopathic pulmonary fibrosis 21,22 . We have also used them to integrate blood and sputum protein signatures associated with COPD exacerbation 23 .
Here, to gain insights into cross-compartment mechanisms associated with a greater lung function decline in COPD, we applied an integrative data-driven modeling pipeline to proteins recovered from matched blood and BAL samples from participants in the bronchoscopy sub-study 24,25 of the SubPopulations and InteRmediate Outcome Measures In COPD Study (SPIROMICS) 26 . Our results suggest that proteomic signatures can effectively detect individuals at increased risk of accelerated lung function decline. They also provide insights into COPD progression mechanisms that can be further investigated in validation cohorts and follow-up murine studies.

Results
Participant characteristics. We evaluated participants of the SPIROMICS bronchoscopy sub-study who had COPD, paired baseline (Visit 1/ V1) and final (Visit 5/ V5) spirometry, plus matched proteomic measurements from plasma samples and BAL samples (n = 45) ( Supplementary Fig. S1). Participants' mean (± SD) age at V1 was 63 ± 7.7 years; they had a follow-up time of 6.3 ± 0.9 years (Table 1). To characterize rapid progression, we dichotomized participants based on their annualized FEV 1 decline (∆FEV 1 ): greater decliners (< 30th percentile) (n = 14) versus lesser decliners (≥ 30th percentile) (n = 31) (Fig. 1a). This threshold equaled a ∆FEV 1 of -70 mL/ year. Greater decliners trended non-significantly to be male (85.7% vs. 54.8%) but were well-matched for other demographic criteria. Despite significantly higher FEV 1 % predicted (p = 0.017) and absolute FEV 1 (p = 0.015) at V1, they experienced a 3.6-fold greater ∆FEV 1 compared to their lesser decliner counterparts (− 104.6 ± 32.0 vs. − 28.8 ± 21.5 mL/year). The observed association between faster decline and higher baseline lung function is in line with previous reports 4,27 . Individual blood and BAL proteins cannot discriminate between rates of longitudinal lung function decline. We first determined whether differences existed between greater decliners and lesser decliners in multi-compartment protein expression measured early in the study, using the concentrations of 1305 blood and 25 BAL proteins measured with SOMAScan and Luminex technology, respectively. To reduce biases associated with the unequal distribution of women across classes (14.3% vs. 45.2%), we first removed proteins (n = 8) that exhibited significant associations with sex ( Supplementary Fig. S2). Across the remaining 1322 proteins, 28 (2.1%) had a mean concentration that differed significantly (p < 0.05) between groups (Fig. 1b). Of these, 13 were increased in greater decliners [log 2 fold change (FC) > 0] and 15 were increased in lesser decliners (FC < 0). The top six most significantly different proteins (p < 0.01) were all identified in blood: Heparin-binding EGF-like growth factor (HBEGF; FC = 0.05), BCL2-related protein A1 (BCL2A1; FC = 0.03), Data-driven modeling identifies a proteomic multi-compartment signature that prospectively differentiates progression phenotypes. We next used data-driven modeling approaches to identify a multi-compartment signature of co-varying proteins associated with greater FEV 1 decline. Here, we combined all 1322 protein measurements (1297 blood and 25 BAL) into a single dataset, then applied elastic net (EN) in tandem with partial-least squares discriminant analysis (PLSDA). First, EN regularization was applied iteratively to 2000 subsets of randomly resampled data. Then, based on their selection frequency throughout the iterations, proteins were ranked (most to least frequent) and fed stepwise into the PLSDA algorithm. We evaluated PLSDA model performance at each step using sixfold cross-validation (CV) and selected the model with the highest CV accuracy as the optimal signature.
Using this feature selection pipeline, we identified a signature of 52 proteins (51 blood and 1 BAL) that distinguished greater decliners (FEV 1 decline ≥ 70 mL/year), along the latent variable 1 (LV1) axis, with 98.4% calibration and CV accuracy, 100% sensitivity, and 96.8% specificity (Fig. 2a-c). Permutation tests performed on participant scores across the first two principal components of PCA models generated with the 52-feature signature, show no significant influence of baseline ICS use (p = 0.35) or smoking status (p = 0.57) on participant classification ( Supplementary Fig. S3). To confirm the accuracy of the selected features, we compared its crossvalidated accuracy to 1000 random signatures, generated by selecting iterative groups of 52 random proteins from the original dataset. None of the random signatures outperformed the optimal model (p < 0.0001) (Supplementary Fig. S4). We also observed a significant, albeit moderate, Pearson correlation between LV1 scores and plasma concentrations of CRP (r p = 0.33, p = 0.03), a protein previously associated with FEV 1 decline 8,10 . However, LV1 scores exhibited no significant correlations with other reported blood markers of spirometric decline, including IL-6 11 (r p = 0.18, p = 0.29), fibrinogen 8 (r p = − 0.01, p = 0.93), and matrix metalloproteinase 9 (MMP-9) 10 (r p = − 0.27, p = 0.08).
Lack of a formal definition of "rapid progression" in COPD has led to literature variability, so we explored whether the identified 52-feature signature maintained significance across alternative characterization approaches. Recently, using the entire SPIROMICS dataset, Anderson et al. proposed a threshold-based definition, classifying progression into three groups based on annualized FEV 1 declines: rapid decliners (> 100 mL/ year), decliners (20-100 mL/year), and stable/ improvers (< 20 mL/year) 28 . The limited number of subjects in our bronchoscopy sub-study exhibiting such extreme decline hindered direct exploration of this definition. However, an exploratory PCA using the 52-feature multi-compartment signature demonstrated significant signature enrichment in rapid decliners thus defined ( Supplementary Fig. S5), suggesting our model reliably extends to this more rigorous definition.
A regression-based analysis also found that participant scores on LV1 correlated highly with annualized declines in FEV 1 (r p = 0.758, p < 0.0001), even after adjustment for age, race, height, sex, baseline FEV 1 % predicted, smoking status, pack-years, and ICS use (p < 0.0001) (Fig. 2d). Alternative estimations of FEV 1 decline using all available longitudinal spirometry (rather than just V1 and V5) produced similar results, as did classifications using FEV 1 % predicted in lieu of absolute FEV 1 volumes ( Supplementary Fig. S6). These results suggest that our approach captures complex progression trends and is largely not skewed by demographic factors influencing lung capacity. For the remainder of our analysis, we use the -70 mL/year definition. Table 1. Baseline characteristics of COPD cases. Two-sample, two-tailed t-test or Fisher's exact test were used to determine significant differences. *Demographic information from baseline visit (Visit 1). † P-values are associated with differences between greater decliner and lesser decliner groups. ‡ Decline in FEV 1 (mL/ yr) ≥ 70 mL/year (see Methods). www.nature.com/scientificreports/ To contextualize the performance of the 52-feature multi-compartment signature, we compared it to crossvalidated analyses based on proteins identified in the univariate analysis ( Fig. 1b) and the literature. Unsurprisingly, the 52-feature signature performed significantly better than independent PLSDA models generated with each of the top six most differentially expressed proteins ( Supplementary Fig. S7) and a combinatorial model generated using all six (Fig. 2e). The signature also significantly outperformed cross-validated models built with published combinations of proteins associated with spirometric decline (sensitivity, specificity: sRAGE and Fibrinogen, 57.1%, 58.0%; Leptin: Adiponectin ratio, 63.5%, 65.1%) (Fig. 2f). To explore how the identified signature characterize controls with relation to COPD subjects, we generated an additional PCA with the 52 signature-identified proteins and a reference groups of tobacco-exposed people with preserved spirometry (TEPPS) (n = 38) who have a history of smoking but no airflow obstruction (FEV 1 /FVC > 0.7) (Supplementary Table S1). Using a permutation test with participants' scores across the first two principal components, we show that TEPPS responses were significantly different from Greater Decliners (p < 0.0001) but not from Lesser Decliners, suggesting our signature can uniquely characterize smoking controls from Greater Decliners alone ( Supplementary Fig. S8).
The progression signature is enriched for proteins involved in the complement system. Having generated a high-performing progression signature, we sought to understand the biological implications of its components. Unsupervised hierarchical clustering identified greater decliners with 88% accuracy (Fig. 3a). A Metascape analysis found 20 significantly enriched ontology clusters (Fig. 3b), with only three, related to aging and phosphorylation-dependent signal transduction, shared between groups. Lesser decliners displayed unique enrichment of 16 clusters related primarily to inflammation and immune functions. Notably, the only uniquely enriched cluster in greater decliners was associated with the complement system (q = 4.10e-04) ( Fig. 3c-d). Proteins in this cluster included blood albumin, bone sialoprotein 2 (IBSP), intracellular adhesion molecule 1 (ICAM1), interferon-gamma (IFN-γ), kynureninase (KYNU), and three complement proteins (iC3b, C3d, Properdin). These three complement proteins were involved in 10 of 20 total clusters and represented three of the top eight loaded proteins in the PLSDA, indicating a potentially significant impact of complement processes in COPD progression.
Dysregulated complement protein signatures are associated with accelerated FEV 1 decline. These enrichment data, coupled with the known importance of the complement cascade to immunity, suggest that complement proteins may be more globally altered in greater decliners than was captured in the original signature. To explore this possibility and focus on complement dysregulation, we performed PCA using the concentration of 22 complement proteins measured in plasma. Results suggested that progression groups differed significantly in complement profiles as measured by scores across PC1 (p = 0.0045) ( Supplementary  Fig. S9). We selected PC1 scores as the PC of interest, as only they correlated significantly with annualized FEV 1 decline (p < 0.001) ( Supplementary Fig. S10a). To explore whether the observed patterns were specific to the greater decliners rather than COPD more generally, we extended our analysis to include the reference group of TEPPS (n = 38). Interestingly, the baseline profiles of TEPPS were similar to that of lesser decliners. In contrast,  www.nature.com/scientificreports/ complement profiles from greater decliners differed from those of lesser decliners and TEPPS, both as visualized using PCA (Fig. 4a) and via direct comparison of PC1 scores (Fig. 4b). The variance in complement profiles, as captured by PC1, correlated with FEV 1 decline (p = 0.004) ( Supplementary Fig. S10b), a relationship that remained significant after adjustment for age, race, height, sex, FEV 1 % predicted, smoking status, pack-years, and ICS use (p = 0.012) ( Supplementary Fig. S10c). In univariate comparisons, among the 22 complement proteins, only C1r, iC3b, C3d, Properdin, and C4 reached statistical significance ( Supplementary Fig. S11). Performing permutation tests with participant scores across the first two principal components labeled by key clinical variables, we show no significant influence of baseline ICS use (p = 0.36) or current smoking status (p = 0. 70) on observed complement profiles ( Supplementary Fig. S12). Collectively, these findings reinforce the Metascape results, suggesting that early patterns of complement dysregulation are specific to a more rapidly progressing phenotype.
Alternative minimal signatures highlight a small number of proteins that maintain high predictive power. Although the large number of proteins in the differentiating signature proved advantageous in exploring functional enrichments associated with accelerated FEV 1 decline, it is too large to be used as a pre-  www.nature.com/scientificreports/ diction tool. Therefore, we next explored whether smaller sub-models ("minimal signatures") maintained predictive value. Using insights from our step forward PLSDA pipeline, we visually analyzed the trade-off between model size and performance. Two minimal signatures, with 6 and 11 features, respectively, had CV accuracies similar to our optimized model (CV accuracy: 6-feature, 81.6%; 11-feature, 88.4%; 52-feature, 98.4%) (Supplementary Fig. S13a). The 6-feature model consisted of BAL Eotaxin and blood iC3b, Leptin, Cadherin-2, HBEGF, and TDGF1. The 11-feature model comprised those six, plus blood-derived FCGR2B, IFN-γ, CA10, Apolipoprotein L1, and LYVE1. ROC curves created for the 11-and 6-feature models resulted in areas under the curve (AUC) of roughly 0.982 and 0.947 for the calibration models and 0.935 and 0.878 for the cross-validated models, respectively ( Fig. 5a; Supplementary Fig. S14). Intriguingly, each minimal signature included at least one protein from both plasma and BAL. We tested proteins from both blood and alveolar compartments to maximize the chances of understanding underlying biology in COPD, specifically the link between distal airways and systemic events. However, due to the invasiveness of bronchoscopy, biomarkers obtained solely from the blood would be preferable. Accordingly, we evaluated a minimal blood signature by applying our step forward EN/PLSDA algorithm exclusively to the 1297 blood proteins measured in the same COPD participants (n = 45). Analysis identified 5-and 10-feature signatures that maintained strong cross-validated performance (CV accuracy: 5-feature, 81.6%; 10-feature, 86.8%) ( Supplementary Fig. S13b). The 5-feature model included blood-derived iC3b, Cadherin-2, Leptin, HBEGF, and TDGF1; the 10-feature model comprised those five, plus CA10, IFN-γ, FCGR2B, LYVE1, and Apolipoprotein L1. Both minimal blood signatures displayed a slight drop in performance compared to their multi-compartment counterparts of nearest sizes (10-and 5-feature signatures, AUC 0.947, 0.901 for calibration models and 0.912, 0.862 for CV models, respectively). However, comparisons using both AUCs and sixfold cross-validation found no significant difference in the performance of any of the minimal signatures and the optimal 52-feature model ( Fig. 5a; Supplementary Fig. S15).
Finally, we subjected the minimal signatures to cross-validated analyses relative to multivariate signatures identified through our univariate analysis or published literature. Comparisons using sixfold CV accuracy indicated that overall performance was largely sustained. The 6-and 11-feature multi-compartment signatures and the 10-feature blood signature significantly outperformed literature-based models and trended towards outperforming a signature based on the top 6 univariate proteins (Fig. 5b-c). The 5-feature blood signature did not reach statistical significance in any comparison, though its performance over literature-based models was substantially improved. All four signatures showed a > 20% increase in sensitivity, with lesser though considerably improved specificity versus other models (Fig. 5d-e). Collectively, these findings imply that small, proteomic signatures can differentiate individuals with tobacco smoking-associated COPD at risk for rapid lung function decline with high accuracy.

Discussion
This study used two complementary datasets from COPD participants in the SPIROMICS bronchoscopy substudy to generate insights into early, aberrant signaling mechanisms associated with accelerated lung function decline. Using a systems analysis, we identified a multivariate signature of early blood and BAL proteins that www.nature.com/scientificreports/ To our knowledge, this is the first longitudinal study of COPD progression to use integrated proteomic datasets derived from blood and BAL samples. Our tandem EN and PLSDA approach identified concise proteomic signatures from thousands of proteins measured across multiple tissue compartments. This framework accurately differentiated individuals with COPD who sustained declines in lung function ≥ 70 mL/year based on proteins measured early after participant enrollment. We specifically sought to explore integrated lung and systemic compartment models because COPD has coupled local and peripheral manifestations 29 . Although several proteins were measured in lung and blood tissue compartments, classifying signatures did not select any matched BAL/blood proteins, which may be partly due to differences in measurement platforms. Moreover, the fact that only one BAL protein was identified in the 52-feature signature is likely in part a consequence of the marked difference in numbers of analytes in blood and BAL (1297 vs. 25). Currently, the SOMAmer technology used in our plasma analyses has not been validated for use in BAL samples. However, the importance of multicompartment representation is exemplified by the over 15% improvement in calibration model sensitivity on adding BAL Eotaxin to the 5-feature blood signature.
Our multi-compartment 52-protein signature was enriched for inflammation and immune responses processes, consistent with the central role of immune dysregulation in COPD pathogenesis [30][31][32] . Chief among these processes was the complement system. By providing evidence that complement alterations precede accelerated FEV 1 decline, we extend previous associations between the levels of blood-derived complement proteins (C3 33,34 , C4 35 , C4b 36 , C5a 37 , C9 38 , Factor B 36 ) and COPD status (case vs. control), cross-sectional analyses of FEV 1 % predicted 39,40 , and emphysema severity 41 . How such aberrations might contribute to airflow limitation is unknown. In a murine model, C3 cleavage contributed to smoking-induced emphysema via an influx of conventional dendritic cells 42 , a cell type that can initiate both innate and adaptive immune responses. Our findings regarding involvement of complement proteins are specific to blood. Although our chosen assay system could not analyze complement components in BAL, complement dysregulation might extend into the lung, as suggested by altered levels of C5a in the sputum in COPD 37,43 and airway C3 deposits in lungs in smoking-associated emphysema 42 . Airway epithelial cells have also been shown to secrete and store C3, suggesting the presence of a localized C3 supply that may aid host defense 44 . The observed global patterns of complement dysregulation were robustly associated with FEV 1 decline, but SomaLogic aptamers cannot reliably distinguish between complement cleavage products and their parent proteins. Hence, we cannot explore the relationship between global complement levels and pathway activation. However, our observation of complement dysregulation before FEV 1 decline provides compelling temporal evidence that the complement pathway contributes to accelerated progression.
Focusing on eventual clinical feasibility, we identified an alternative parsimonious 10-protein signature derived using only peripheral blood; its potential prognostic value, if validated, is suggested by its superior performance to reported multivariate biomarkers of FEV 1 decline 8,15 . To our knowledge, except for leptin 15 , these proteins have not been associated with accelerated loss of lung function. Consistent with previous studies [8][9][10][11] , the proteins in this signature are primarily related to immune and inflammation responses (leptin, iC3b, IFN-γ, FCGR2B, APOL1). However, we also observed notable contributions from endothelial-mesenchymal transition (EMT) proteins (Cadherin-2 and HBEGF). EMT is active in both large and small airways of COPD and relates to airflow obstruction [45][46][47] . Cadherin-2 is increased in epithelial cells from COPD subjects as compared to healthy controls 48 . Similarly, serum and sputum HBEGF levels have been positively associated with COPD severity measures, including FEV 1 % predicited 49 and CAT score 50 . The final proteins involved in the signature (TDGF1, LYVE1, CA10) have not previously been associated with COPD. However, in lung cancer, which is thought to share overlapping etiologic features, TDGF1 (increased in greater decliners) predicts poor progression-free survival 51 , while LYVE1 (increased in lesser decliners) is associated with reduced metastasis and mortality 52 . Even in this parsimonious signature, we observed diverse biological enrichment. These findings emphasize the value of multivariate signatures in evaluating heterogeneous conditions such as COPD, where abnormalities in several pathways or pathway constituents likely drive a singular clinical outcome.
Our study has limitations. Chief among these is the lack of a validation cohort as, to our knowledge, none currently exists with both BAL protein measurements and longitudinal follow-up. Missing data also limited our sample size. BAL was not collected successfully on all participants, chiefly due to airway collapse during the procedures, and not all participants completed V5. Hence, we have relatively small numbers of participants, disproportionately non-Hispanic whites. To mitigate the influence of individual participants, we applied an iterative bootstrapping framework during model generation. Still, the generalizability of our findings remains unclear, given the issues of limited heterogeneity based on sample size and demographics. We recognize that we cannot definitively conclude whether the identified signatures precede lung function decline, as decline may have been ongoing prior to baseline sample acquisition. However, the modestly higher baseline FEV 1 observed in greater decliners support this possibility. Additionally, the cross-sectional nature of our proteomic data limits any insight into the temporal stability of our identified signatures. Because there is no universally agreed-upon definition of rapid progression, we used a percentile cut-off in FEV 1 decline, as used by others 15,53 ; resulting in a cut-off (≥ 70 mL/year) similar to previous reports. However, other threshold-based definitions (i.e., FEV 1 decline > 100 mL/year) have been proposed 28 . Few participants (n = 6) experienced declines > 100 mL/year in our data, providing insufficient power to investigate this definition accurately. However, in an exploratory PCA, we show that our 52-feature multi-compartment is enriched significantly in participants with declines > 100 mL/ year, suggesting our signature extends to this more stringent characterization. Moreover, while we acknowledge that a fixed cut-off definition may favor an overrepresentation of males as greater decliners due to physiologic differences in lung function measurements between sexes, we are underpowered to explore sex-stratified analyses. Still, alternative estimations of FEV 1 decline using FEV 1 % predicted, which accounts for age, sex, and body composition, in lieu of absolute FEV 1 volumes, produced similar results, suggesting a minimal impact of sex on group affiliation in this study. Lastly, a longitudinal decline in FEV 1 is only one parameter that can evaluate progression and is less sensitive to capturing changes in small airway loss than other clinical measures, like www.nature.com/scientificreports/ parametric response mapping. However, it is worth noting that elements of our signature may reflect potential markers of small airway damage, as our signature exhibits enrichment of several processes that contribute to small airway damage, such as the response to wounding and ECM organization 54 . Nonetheless, future studies are needed to explore the relevance of the identified signature in assessing other outcomes. In summary, data-driven modeling approaches identified early cross-tissue compartment proteomic signatures and provided insight into potential mechanisms associated with accelerated disease progression in COPD. This work highlights the ability of quantitative, systems-focused analytical techniques to accomplish both these goals. Data-driven modeling approaches could be applied to integrate spatiotemporal data in clinical samples from other diseases with a progressive or heterogeneous population.

Methods
Human participants. SPIROMICS (ClinicalTrials.gov Identifier: NCT01969344) is an ongoing multicenter, prospective observational study designed to identify new COPD subgroups and intermediate biomarkers of disease progression 26 . Briefly, we enrolled participants aged 40-80 years at entry with a history of cigarette smoking (≥ 20 pack-years), either with COPD by the fixed ratio definition (post-bronchodilator FEV 1 /FVC < 0.7), or without COPD; as controls, we recruited healthy individuals without smoking history. SPIROMICS participants (n = 2,974) underwent a baseline examination (V1) followed by yearly visits for up to three years and a final follow-up visit (V5) approximately 5-8 years after V1. The first participant entered on November 10, 2010, and we censored all data on July 31, 2021. The study was conducted according to the principles of the Declaration of Helsinki. The human study protocol was approved by the institutional review board of all participating centers and methods were carried out in accordance with the relevant guidelines and regulations ( Some SPIROMICS subjects (n = 215) from all groups except those with severe (GOLD 4) COPD participated in a bronchoscopic sub-study 24,25 , which included BAL of the right middle lobe and lingula. Participants (n = 149) with a history of smoking who had available blood and BAL samples were considered for inclusion in the initial Elastic Net analysis ( Supplementary Fig. S1). This analysis was restricted to participants (n = 85) of that sub-study who had a history of smoking, available spirometry that did not improve at V5, and full biospecimens, which were plasma samples at V1 and BAL cytokine analysis; their baseline characteristics are shown in Table 1. They comprised two study groups: COPD cases (n = 45) and a reference group (n = 40) of TEPPS who had no airflow obstruction at both V1 and V5 (Supplementary Table S1). The demographics of our study participants (n = 85) did not differ significantly from the entire SPIROMICS bronchoscopy cohort (Supplementary Table S2).
Based on the magnitude of annual change in FEV 1 , we dichotomized the COPD cases into greater decliners (< 30th percentile; n = 14) versus lesser decliners (≥ 30th percentile; n = 31). FEV 1 decline was calculated using the two-point slope equation: [V5 FEV 1 -V1 FEV 1 ]/time, where time is the duration, in years, from V1 to V5 for each subject. Time calculations assumed a fixed-length year equal to 365.2425 days.

Sample preparation & datasets.
Blood dataset: Fresh plasma samples collected at V1 were frozen in either an EDTA collection tube or a P100 tube with K2EDTA 55,56 . SOMAmer© (slow off-rate modified aptamer) technology 57 (SomaLogic, Boulder, CO) was used to measure 1305 proteins from participants in the SPIRO-MICS bronchoscopy sub-study.
BAL dataset: We measured the concentration of 48 proteins including cytokines, chemokines, and growth factors (HCYTA-60 K-PX48, Milliplex, EMD Millipore Corporation) in BAL aliquots from a subset of participants in the SPIROMICS bronchoscopy sub-study (n = 184) using Luminex FlexMAP 3D (Luminex Corporation, Austin, TX) technology. Any results above the upper limit of detection were set to the maximum detectable concentration of that analyte. We set samples below the lower limit of detection to be half the lowest minimum detectable concentration across the standard curves of all analytes. We removed the 23 proteins in which ≥ 50% of measurements were below the lower limit of detection across all samples, yielding 25 analyzable BAL proteins. Before analysis, we normalized all BAL protein concentrations to the total BAL protein concentration of the respective sample, as quantified by a Pierce BCA Protein Assay Kit (Pierce Protein Biology, Rockford, IL).
Multi-compartment dataset: We removed eight proteins that were associated with sex in standard two-tailed, two-sample t-test after correction for multiple comparisons using Benjamini-Hochberg. The final dataset consisted of 1322 proteins (1297 blood and 25 BAL). All analytes were log-transformed for normality before analysis.
Derivation of data-driven progression signature(s). Relative fold-changes in the expression levels of individual proteins from the blood and BAL were calculated by dividing the average concentration of each protein in COPD greater decliners by the average concentration in lesser decliners.
Based on proteomic measurements from the COPD participants, we generated optimal progression signatures using EN in tandem with PLSDA for feature selection in the: (a) combined blood and BAL and (b) blood-only datasets. First, the data were randomly sampled without replacement to generate 2000 subsets. To correct for effects of class size imbalances during regularization, we completed resampling at the size of the smallest class. We then performed EN regularization on each of the 2000 subsets. Once regularization was complete, the proteins Bioinformatic analysis. Clustering: Hierarchical clustering of the 52-feature signature based on blood and BAL proteins was generated with supervised average linkage clustering using Spearman's correlation coefficient as the distance metric. Samples were colored by progression status.
Metascape analysis: Metascape 59 [https:// metas cape. org] was used to identify biological processes that were significant and differentially enriched between greater decliners and lesser decliners based on the identified 52-feature signature. PLSDA loadings on LV1 were used to dichotomize proteins between cohorts, such that proteins with positive or negative loadings were increased in lesser decliners or greater decliners, respectively.